# CS3350B Computer Organization Chapter 4: Instruction-Level Parallelism Part 1: Pipelining

Iqra Batool

Department of Computer Science University of Western Ontario, Canada

Wednesday February 28, 2024

#### Outline

- 1 Overview
- 2 Pipelining: An Analogy
- 3 Pipelining For Performance

#### Instruction-Level Parallelism

For a computer architecture, its **instruction-level parallelism** (ILP) is a measure of the number of instructions it can perform simultaneously.

ILP is usually achieved *dynamically*—after compile time—by the processor itself manipulating program execution.

Circuitry (and appropriate control signals) needs to be added to the processor to handle the execution of many instructions simultaneously and to handle the dynamic nature of ILP.

# Achieving ILP

ILP can be achieved in many ways. Some topics we will look at:

- Pipelining
- Superscalar execution
- VLIW very long instruction word
- Register renaming
- Branch prediction

#### Lesson Goals

- Single-cycle vs Multi-cycle vs Pipelined data path.
- Understand pipelining as a concept to achieve parallelism.
- Understand how circuitry is used to:
  - □ Enable multi-cycle datapaths; in turn enabling pipelining.
  - □ Decrease minimum clock cycle time.
- Pipelining metrics: latency, throughput, ideal speedup, actual speedup.

### "Pipelining" in Combinational Circuits



Break up a combinational circuit, reduce propagation delay, insert a register to store intermediate results, increase clock frequency.

### Pipe, Pipeline, Pipelining

Unix pipe: pass data from one program to another.

**Data pipeline:** a sequential series of processing elements (CPUs, circuits, programs, etc.) where the output of one is passed as the input to another. Buffer storage is needed between elements to store temporary data.

**Pipelining:** a technique for instruction-level parallelism where each stage of the datapath is always kept busy. Instructions are **overlapped**.

### Pipelining the RISC Datapath

| IF         | ID | EX | MEM | WB  |     |     |     |    |
|------------|----|----|-----|-----|-----|-----|-----|----|
| ↓ <i>i</i> | IF | D  | EX  | MEM | WB  |     |     |    |
| <i>t</i> → |    | IF | ID  | EX  | MEM | WB  |     |    |
|            |    |    | IF  | ID  | EX  | MEM | WB  |    |
|            |    |    |     | IF  | ID  | EX  | MEM | WB |

- Each stage is executing a *different* instruction.
- $\blacksquare$  5 stages  $\Longrightarrow$  5 instructions executed at once.

#### Outline

- 1 Overview
- 2 Pipelining: An Analogy
- 3 Pipelining For Performance

### Doing Laundry











- We have 4 loads of laundry to do: A, B, C, D.
- To process each load we need to:

  - □ Dry
  - →
     Fold
- Each stage of doing laundry takes 30 minutes.
- Could process each load sequentially or use pipelining.

# Doing Laundry: Sequentially



- Each load of laundry is done one at a time:
  - → Wash A, Dry A, Fold A, Put-away A.
  - Wash B, Dry B, Fold B, Put-away B. ⋮
- Takes 8 hours in total. There has to be a better way.

Wednesday February 28, 2024

### Doing Laundry: Pipeline



- Each stage of doing laundry must process each load sequentially.
- **But** each load of laundry can overlap.
- No dependency between drying load A and washing load B, etc.
- Put-away A while Folding B while drying C while washing D.
- Takes 3.5 hours in total.

### Pipelining Terms via Analogy



- Pipelining: many tasks (loads of laundry) being executed simultaneously using different resources (washer, dryer, etc.).
- Time to complete a single task (latency) does not change.
  - ⇒ Each load by itself still takes 2 hours.
- Number of tasks that can be completed in one unit of time (throughput) increases.
- Potential speed up via pipelining equals the number of stages in pipeline.
- Actual speed-up never exactly equals potential.

# Pipelining Terms via Analogy



- Actual speed-up never exactly equals potential.
- Fill time: time taken to "fill" the pipeline. Initially, not every stage is used.
- Drain time: time taken to "empty" the pipeline. Not all stages are used once the last task begins.
- Imagine a new washing machine takes only 20 minutes. This *does not* increase pipeline speed.
  - □ Dryer still takes 30 minutes.
  - Washer must wait for dryer to finish before laundry can move from washer to dryer.

#### Outline

- 1 Overview
- 2 Pipelining: An Analogy
- 3 Pipelining For Performance

# The RISC Datapath



### Review: Single Cycle Datapath



- Clock cycle is long enough to handle *critical path* through datapath.
- Time for data to pass through entire datapath.

# Performance of Single Cycle Datapath

- Let's assume that accessing memory takes 200ps and ALU propagation delay is 200ps.
  - → IF stage, EX stage, MEM stage.
- Let's assume accessing registers takes 100ps.

If there is no lu leve,
the minimal clock cycle is

#### ■ What is the minimum clock cycle?

Sum of all stages since some instructions use all stages.

$$\rightarrow$$
 200 + 100 + 200 + 200 + 100 = 800ps.

| Instr. | IF    | ID    | EX    | MEM   | WB    | Total |
|--------|-------|-------|-------|-------|-------|-------|
| R-type | 200ps | 100ps | 200ps | _     | 100ps | 600ps |
| Branch | 200ps | 100ps | 200ps | -     | -     | 500ps |
| SW     | 200ps | 100ps | 200ps | 200ps | -     | 700ps |
| > Iw   | 200ps | 100ps | 200ps | 200ps | 100ps | 800ps |

determine ->
the minimum

MCC: Min (2, 3, sw, hw).

Wednesday February 28, 2024

#### Improving Performance of Datapath

- Clock frequency
- Parallel execution of instructions via overlap: **pipelining**.
- Superscalar, VLIW (to come later).
- Branch prediction (to come later).



#### Review: Multi-Cycle for Combinational Circuits



Break up a combinational circuit, reduce propagation delay, insert a register to store intermediate results, increase clock frequency.

# Multi-Cycle Datapath for MIPS



# Multi-Cycle Datapath: pre-reg for popularing.



- Clock cycle is long enough to handle <u>slowest stage</u> of the pipeline.
- Time for data to pass through one (the slowest) stage of pipeline.

Example: Minimum clock cycle is 200ps.

|        | 1/    |       | 1     |       |       |       |
|--------|-------|-------|-------|-------|-------|-------|
| Instr. | IF    | ID    | EX    | MEM   | WB    | Total |
| R-type | 200ps | 100ps | 200ps | _     | 100ps | 600ps |
| Branch | 200ps | 100ps | 200ps | _     | -     | 500ps |
| SW     | 200ps | 100ps | 200ps | 200ps | _     | 700ps |
| lw     | 200ps | 100ps | 200ps | 200ps | 100ps | 800ps |

in is not overlapping

### Pipelining for Performance

- Further increase clock frequency?
- Could break up datapath into more and more stages but...

  - $\rightarrow$  More complexity in datapath and controller design  $\Rightarrow$  overhead.
  - Still limited by slowest stage (memory).
- Leverage the parallelism gained by pipelining.
- Parallelism in execution of instructions yields fewer cycles per instruction (CPI)

#### The Classic Performance Equation

#### RISC Pipeline Performance CPI 1.

- Overlap instructions, start the next before the former completes.
- Some instructions will "waste" a cycle as they flow through unused stages.

|     | Cycle 1 | Cycle 2 | Cycle 3 | Cycle 4 | Cycle 5 | Cycle 6 | Cycle 7 | Cycle 8 |
|-----|---------|---------|---------|---------|---------|---------|---------|---------|
| lw  | IFetch  | Dec     | Exec    | Mem     | WB      |         |         |         |
| sw  |         | IFetch  | Dec     | Exec    | Mem     | WB      |         |         |
| add |         |         | IFetch  | Dec     | Exec    | Mem     | WB      |         |

Latency: time to complete one instruction. Does not decrease with red pipelining. May actually increase slightly if each stage originally did not take the same amount of time.

- Throughput: number of instructions that can be completed in some amount of time. Increases with pipelining.
- Once pipeline is full CPI is 1. Cidently).

Wednesday February 28, 2024

# Performance: With and Without Pipelining



#### Pipeline Parallelism



- Potential speed-up via parallelism is equal to the number of stages.
- 5 stages  $\Longrightarrow$  5x potential speed up.
- A pipeline is "full" when every stage is occupied by an instruction (every stage does not have to necessarily be doing work).
- Pipeline fill time and drain time reduce actual speed up.

# Quantifying Pipelined Speedup

If the time for each stage is the same:

If the time for each stage is not the same:

$$\label{eq:cycle} Ideal\ Speedup = \frac{Time\ between\ instructions_{single-cycle}}{Time\ between\ instructions_{pipelined}}$$

Note that a single-cycle datapath always has latency less than or equal to a multi-cycle/pipelined datapath.

#### Calculating Speedup

#### From previous example:

- Single-cycle datapath: 800ps clock cycle.
- Pipelined: 200ps clock cycle.
- Uneven time for each stage. ID and WB only 100ps.
- 3 lw instructions. as number instructions increase

  Fill time & drain time ignored.

$$\mbox{Ideal Speedup} = \frac{800}{200} = 4 \qquad \qquad \mbox{Actual Speedup} = \frac{2400}{1400} = 1.714$$

■ If we have 1000000 lw instructions?

Actual Speedup = 
$$\frac{1000000 \times 800}{1000000 \times 200 + 800} \approx 4$$

Wednesday February 28, 2024

#### Calculating Pipelined Time

Classic Performance Equation:

$$CPU time = Instruction\_count \times CPI \times clock cycle$$

#### Time for pipelined execution:

$$Time_{pipelined} = Fill time + (IC \times clock cycle)$$

- (Assuming no stalls or hazards.)
- Once pipeline is full, one instr. completes every cycle  $\Rightarrow$  CPI is 1.
  - $\rightarrow$  Gives IC  $\times$  1  $\times$  clock cycle
- Pipeline is only not full during fill or drain time.
- lacksquare Fill time = Drain time = (number of stages 1) imes clock cycle
  - → Assuming number of instructions > number of stages.



#### Calculating Pipelined Time



#### Summary

- Pipelining is the simultaneous execution of multiple instructions each in a different stage of the datapath.
- Pipelining gives increased clock frequency by multi-cycle datapath.
- Limited by the slowest stage.
- Pipelining gives essentially a CPI of 1.
- Speed-up must account for fill time and drain time.
- All of the discussion so far assumed there is **no conflicts** between instructions, hardware, circuits, etc.
  - → Pipeline hazards severely impact performance and potential speed-up.